These slides introduce the basics of bivariate OLS regression using cross-sectional data. We’ll analyze how infant mortality rates (IMR) vary across countries based on their political systems and economic development. Until the end, we’ll focus on bivariate models. The goal is just to think through some of the elements of the model and questions we face when modeling - you’ll benefit from having reviewed the matrix derivation slides, and the thinking about data slides.
Data Exploration
Let’s examine the dataset’s structure and summary statistics. The data include one observation for each country, and measures of infant mortality, GDP per capit and its natural log, deaths from natural disasters, whether there’s systematic censorship, and the polity scale. Here are a few ways to look at the data.
The estimate of \(\beta_0\) is the mean of the dependent variable, IMR. The constant-only model is a simple way to understand the data’s central tendency.
summary(censor$IMR)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.388 14.934 45.865 55.946 90.438 191.787
Political System and Infant Mortality
Now we’ll examine how a country’s political system (measured by the Polity score) affects infant mortality. The Polity score ranges from -10 (most autocratic) to +10 (most democratic). Let’s look at the Polity variable:
Polity is categorical and ordered. It’s commonly (mis)treated as a continuous variable in the literature - we’ll start there:
code
m1 <-lm(IMR ~ polity, data=censor)summary(m1)
Call:
lm(formula = IMR ~ polity, data = censor)
Residuals:
Min 1Q Median 3Q Max
-76.97 -31.95 -11.67 35.30 120.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.4898 4.0820 16.778 < 2e-16 ***
polity -2.7999 0.5431 -5.155 7.84e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 41.22 on 151 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.1496, Adjusted R-squared: 0.144
F-statistic: 26.57 on 1 and 151 DF, p-value: 7.836e-07
The estimated coefficient of -2.7999 suggests that, on average, each unit increase in the Polity score is associated with a 2.7999 decrease in the IMR; the estimate is statistically different from zero.
Let’s take a brief detour to think about the errors in our model. First, let’s compute the sum of squared residuals, \(\sigma^2\), and the residual standard error. Remember the residuals are given by \(y - \hat{y}\); \(\sigma^2\) is the sum of squared residuals divided by \(n-k-1\); and the residual standard error is the square root of \(\sigma^2\).
Comparing this to the estimates above, you’ll see our estimate of the RSE is the same as R’s estimate. This is the average residual in the model in terms of the \(y\) variable, IMR - so on average, we’re off by about 41 infant deaths per 1000 live births.
Generate predictions
Let’s compute \(\widehat{y}\) for each category of Polity and plot along with the actual observed value of IMR for each country.
Since Polity is categorical, we don’t know the intervals between categories, so interpreting this coefficient (-2.79) as an effect of a 1 unit change on \(y\) really doesn’t make much sense. We are assuming the effect of a change in Polity is the same across all unit changes in the Polity scale - not only are we assuming the change is the same magnitude, but that it’s in the same direction - that the effect is linear.
Another way to treat Polity in this model is to include it as a factor variable. This allows us to estimate the effect of each level of Polity on IMR, relative to a reference level. In effect, we are including a dummy variable for all but one of the cetegories of Polity. Let’s look at such a model:
Let’s generate predictions from the factor model. Each bar represents the 95% confidence interval for the predicted IMR at each level of Polity; the dots are the predictions.
A couple of things to note regarding these predictions. Because Polity is categorical, we cannot draw a continuous line across the categories. You’ll notice that it appears from the coefficients and from the predictions that the effect of Polity on IMR is not linear. The common practice of treating Polity as a continuous variable would be problematic in this model (and in most). Also, notice the ranges of the confidence intervals vary quite a lot. The variability within Polity category raises potentially interesting questions about development and regime.
Economic Development and Infant Mortality
Let’s examine how GDP per capita affects infant mortality rates, exploring different functional forms.
Looking at the predictions, it’s pretty clear the scatterplot of IMR and GDP per capita doesn’t suggest a linear relationship.
code
# Generate predictionsanalysisdata <- censor %>%mutate(xb =coef(m2)[1] +coef(m2)[2]*gdppc)pred <-predict(m2, interval="confidence", se.fit=TRUE)analysisdata <-cbind(analysisdata, pred)# # # Plot with predictions# ggplot(analysisdata, aes(x=gdppc, y=IMR)) +# geom_point(color="green") + # geom_text(label=censor$ctry, size=2) +# geom_line(aes(x=gdppc, y=xb)) +# labs(x="GDP per capita", y="Infant Mortality Rate")# same as above using highcharter highchart() %>%hc_add_series(data = censor,name ="Observed IMR",type ="scatter",hcaes(x = gdppc, y = IMR),color ="#005A43",dataLabels =list(enabled =TRUE, format ="{point.ctry}") ) %>%hc_add_series(data = analysisdata,name ="Predicted IMR",type ="line", color="#6CC24A",hcaes(x = gdppc, y = xb) ) %>%hc_xAxis(title =list(text ="GDP per capita")) %>%hc_yAxis(title =list(text ="Infant Mortality Rate"))
Residuals
Let’s examine the residuals from this model - recall the residuals are the differences between the observed and predicted values of \(y\).
code
# Calculate residualsanalysisdata <- analysisdata %>%mutate(res = IMR - xb)# # ggplot(analysisdata, aes(x=gdppc, y=res)) +# geom_point(color="green") + # geom_text(label=censor$ctry, size=3) +# geom_abline(slope=0, intercept=0) +# labs(x="GDP per capita", y="Residuals")# same plot in highcharterhighchart() %>%hc_add_series(data = analysisdata,name ="Residuals",type ="scatter",hcaes(x = gdppc, y = res),color ="#005A43",dataLabels =list(enabled =TRUE, format ="{point.ctry}") ) %>%hc_add_series(data =list(list(x =min(analysisdata$gdppc), y =0), list(x =max(analysisdata$gdppc), y =0)),type ="line",color ="red",enableMouseTracking =FALSE,showInLegend =FALSE ) %>%hc_xAxis(title =list(text ="GDP per capita")) %>%hc_yAxis(title =list(text ="Residuals"))
Both the predictions and the residuals suggest the relationship between GDP and IMR is not linear. Let’s consider some alternatives that might fit the data better.
Log-transformed GDP
Let’s try a log transformation of GDP per capita \(ln(GDP_{pc})\) and see if that improves the model.
and take a look at the predictions from this model:
code
# Generate predictionsanalysisdata2 <- analysisdata2 %>%mutate(xb =coef(m3)[1] +coef(m3)[2]*lngdp)#sort by lngdpanalysisdata2 <- analysisdata2[order(analysisdata2$lngdp),]# plot using highcharterhighchart() %>%hc_add_series(data = analysisdata2,name ="Observed IMR",type ="scatter",hcaes(x = gdppc, y = IMR),color ="#005A43",dataLabels =list(enabled =TRUE, format ="{point.ctry}") ) %>%hc_add_series(data = analysisdata2,name ="Predicted IMR",type ="line", color="#6CC24A",hcaes(x = gdppc, y = xb) ) %>%hc_xAxis(title =list(text ="GDP per capita")) %>%hc_yAxis(title =list(text ="Infant Mortality Rate"))
The log-transformed model seems to fit the data better than the linear model. The residuals are also more evenly distributed around zero, though still appear correlated with GDP per capita.
code
# Calculate residualsanalysisdata2 <- analysisdata2 %>%mutate(res = IMR - xb)# Plot using highcharterhighchart() %>%hc_add_series(data = analysisdata2,name ="Residuals",type ="scatter",hcaes(x = gdppc, y = res),color ="#005A43",dataLabels =list(enabled =TRUE, format ="{point.ctry}") ) %>%hc_add_series(data =list(list(x =min(analysisdata2$gdppc), y =0), list(x =max(analysisdata2$gdppc), y =0)),type ="line",color ="red",enableMouseTracking =FALSE,showInLegend =FALSE ) %>%hc_xAxis(title =list(text ="GDP per capita")) %>%hc_yAxis(title =list(text ="Residuals"))
Polynomials
Finally, let’s try a polynomial model to see if we can improve the fit further. We’ll try a quadratic model, \(IMR = \beta_0 + \beta_1 \times GDP_{pc} + \beta_2 \times GDP_{pc}^2\).
GDP per capita is negatively related to IMR, then positively related at higher values. Here are predictions from the polynomial model (including a 95% confidence interval) and the actual data points:
And let’s look at the residuals from the polynomial model:
code
# Calculate residualspredictions2 <- predictions2 %>%mutate(res = IMR - fit.fit)# Plot using highcharterhighchart() %>%hc_add_series(data = predictions2,name ="Residuals",type ="scatter",hcaes(x = gdppc, y = res),color ="#005A43",dataLabels =list(enabled =TRUE, format ="{point.ctry}") ) %>%hc_add_series(data =list(list(x =min(predictions2$gdppc), y =0), list(x =max(predictions2$gdppc), y =0)),type ="line",color ="red",enableMouseTracking =FALSE,showInLegend =FALSE ) %>%hc_xAxis(title =list(text ="GDP per capita")) %>%hc_yAxis(title =list(text ="Residuals"))
The residuals exhibit less pattern than the linear model, but seem to overfit some of the nonlinearity in the sense that the residuals appear nonlinear and perhaps in the opposite direction of the observed data.
Multivariate Model: Combined Effects of Economics and Politics
Finally, let’s examine how both economic development and political system jointly affect infant mortality rates. This multivariate model is somewhat more realistic insofar as we don’t believe either of these variables acts alone in shaping infant mortality rates. Here, we’ll use the logged GDP per capita measure as that functional form seemed to fit the data best (informally).
code
# Fit multivariate modelm5 <-lm(IMR ~ lngdp + polity, data=censor)modelsummary(m4)
(1)
(Intercept)
110.123
(4.016)
gdppc
-0.015
(0.001)
gdp2
0.000
(0.000)
Num.Obs.
155
R2
0.656
R2 Adj.
0.652
AIC
1457.2
BIC
1469.3
Log.Lik.
-724.588
F
145.206
RMSE
25.94
code
# Generate predictions for different political systemsgdp_range <-seq(min(censor$gdppc), max(censor$gdppc), length.out=100)predictions_df <-data.frame(gdppc =rep(gdp_range, 2),lngdp =log(rep(gdp_range, 2)),polity =c(rep(-10, 100), rep(10, 100)))# Calculate predicted valuespredictions_df$predicted <-predict(m5, newdata=predictions_df)library(highcharter)library(dplyr)# Generate predictions for different political systemsgdp_range <-seq(min(censor$gdppc), max(censor$gdppc), length.out=100)predictions_df <-data.frame(gdppc =rep(gdp_range, 2),lngdp =log(rep(gdp_range, 2)),polity =c(rep(-10, 100), rep(10, 100)))# Calculate predicted valuespredictions_df$predicted <-predict(m5, newdata=predictions_df)# Create highcharthighchart() %>%hc_add_series(data = predictions_df[predictions_df$polity ==-10,],hcaes(x = gdppc, y = predicted),type ="line",color ="#005A43",name ="Polity = -10") %>%hc_add_series(data = predictions_df[predictions_df$polity ==10,],hcaes(x = gdppc, y = predicted),type ="line",color ="black",name ="Polity = 10") %>%hc_add_series(data = censor,hcaes(x = gdppc, y = IMR),type ="scatter",name ="Observed Values",marker =list(symbol ="circle"),color ="#6CC24A",dataLabels =list(enabled =TRUE, format ="{point.ctry}"))%>%# hc_add_series(data = censor,# hcaes(x = gdppc, y = IMR),# type = "scatter",# name = "Country Labels",# dataLabels = list(# enabled = TRUE,# format = "{point.ctry}",# style = list(fontSize = "8px")# )) %>%hc_xAxis(title =list(text ="GDP per capita")) %>%hc_yAxis(title =list(text ="Predicted Infant Mortality Rate")) %>%hc_title(text ="Predicted IMR by GDP and Political System") %>%hc_legend(title =list(text ="Polity Score")) %>%hc_tooltip(shared =FALSE)
This final plot shows how infant mortality rates are predicted to vary with GDP per capita for both highly autocratic (Polity = -10) and highly democratic (Polity = +10) countries. The actual observed values are shown as green points with country labels.